Case Study

New York Police Department cares about citizens of New York and NYPD hired Data Scientist, who will provide answer for following question:

How can we improve New York City in order to reduce vehicle collsions and improve safety on roads?

Answer can be provided with comprehensive data analysis which enables to understand main factors that cause vehicle crashes.

Aim of this project is to investigate main contributions that have an affect on the safety on roads by providing answers for following questions:

1. When collisions happen the most frequently? Is it correlated with specific months, weekdays or hours?

2. Where collisions appear the most frequently? Which spots are the most dangerous ones?

3. What are the main factors that contribute collisions in NYC?

dataNYC <- readRDS("data_NYC.rds")
glimpse(dataNYC)
## Observations: 1,185,165
## Variables: 30
## $ borough                       <chr> "QUEENS", NA, "MANHATTAN", "QUEE...
## $ contributing_factor_vehicle_1 <chr> "Fatigued/Drowsy", "Driver Inatt...
## $ contributing_factor_vehicle_2 <chr> "Unspecified", "Unspecified", "U...
## $ date                          <dttm> 2012-10-21, 2012-10-21, 2012-10...
## $ latitude                      <chr> "40.6869122", "40.8493533", "40....
## $ location.type                 <chr> "Point", "Point", "Point", "Poin...
## $ location.coordinates          <list> [<-73.79437, 40.68691>, <-73.87...
## $ longitude                     <chr> "-73.7943714", "-73.8711899", "-...
## $ number_of_cyclist_injured     <chr> "0", "0", "0", "0", "0", "1", "0...
## $ number_of_cyclist_killed      <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_motorist_injured    <chr> "2", "3", "0", "0", "0", "0", "0...
## $ number_of_motorist_killed     <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_pedestrians_injured <chr> "0", "0", "0", "0", "0", "0", "2...
## $ number_of_pedestrians_killed  <chr> "0", "0", "0", "0", "0", "0", "0...
## $ number_of_persons_injured     <chr> "2", "3", "0", "0", "0", "1", "2...
## $ number_of_persons_killed      <chr> "0", "0", "0", "0", "0", "0", "0...
## $ off_street_name               <chr> "113 AVENUE                     ...
## $ on_street_name                <chr> "SUTPHIN BOULEVARD              ...
## $ time                          <chr> "11:30", "11:30", "11:50", "11:5...
## $ unique_key                    <chr> "268825", "2929446", "52928", "2...
## $ vehicle_type_code1            <chr> "SPORT UTILITY / STATION WAGON",...
## $ vehicle_type_code2            <chr> "SPORT UTILITY / STATION WAGON",...
## $ zip_code                      <chr> "11435", NA, "10023", "11377", N...
## $ contributing_factor_vehicle_3 <chr> NA, "Unspecified", NA, NA, NA, N...
## $ vehicle_type_code_3           <chr> NA, "PASSENGER VEHICLE", NA, NA,...
## $ cross_street_name             <chr> NA, NA, NA, NA, "PARKING LOT OF ...
## $ contributing_factor_vehicle_4 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_4           <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ contributing_factor_vehicle_5 <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ vehicle_type_code_5           <chr> NA, NA, NA, NA, NA, NA, NA, NA, ...

The vehicle collision data was collected by the New York Police Department and published by NYC OpenData.

Source: https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95

The motor vehicle collision dataset includes the date and time, location (as borough, street names, zip code and latitude and longitude coordinates), injuries and fatalities, vehicle number and types, and related factors for almost 1.2 million collisions in New York City during 5 years (from 1st July 2012 until 31st December 2017).

Each observation contains is related to 1 collision.

Data Cleansing

Aim of this section is to improve quality of data. Columns will be renamed, data-types need to be checked and values of variables must be checked.

Renaming Variables

data=dataNYC
names(data) <- gsub("number_of_", "", names(data), fixed = TRUE)
names(data) <- gsub("_", ".", names(data), fixed = TRUE)
names(data) <- gsub("code1", "CODE.1", names(data), fixed = TRUE)
names(data) <- gsub("code2", "CODE.2", names(data), fixed = TRUE)
names(data) <- gsub("code3", "CODE.3", names(data), fixed = TRUE)
colnames(data)[c(20, 23, 7, 6)] <-c("UNIQUE.KEY", "ZIP.CODE", "LOCATION.COORDINATES", "LOCATION.TYPE")
names(data) <- toupper(names(data))
colnames(data)
##  [1] "BOROUGH"                       "CONTRIBUTING.FACTOR.VEHICLE.1"
##  [3] "CONTRIBUTING.FACTOR.VEHICLE.2" "DATE"                         
##  [5] "LATITUDE"                      "LOCATION.TYPE"                
##  [7] "LOCATION.COORDINATES"          "LONGITUDE"                    
##  [9] "CYCLIST.INJURED"               "CYCLIST.KILLED"               
## [11] "MOTORIST.INJURED"              "MOTORIST.KILLED"              
## [13] "PEDESTRIANS.INJURED"           "PEDESTRIANS.KILLED"           
## [15] "PERSONS.INJURED"               "PERSONS.KILLED"               
## [17] "OFF.STREET.NAME"               "ON.STREET.NAME"               
## [19] "TIME"                          "UNIQUE.KEY"                   
## [21] "VEHICLE.TYPE.CODE.1"           "VEHICLE.TYPE.CODE.2"          
## [23] "ZIP.CODE"                      "CONTRIBUTING.FACTOR.VEHICLE.3"
## [25] "VEHICLE.TYPE.CODE.3"           "CROSS.STREET.NAME"            
## [27] "CONTRIBUTING.FACTOR.VEHICLE.4" "VEHICLE.TYPE.CODE.4"          
## [29] "CONTRIBUTING.FACTOR.VEHICLE.5" "VEHICLE.TYPE.CODE.5"

Dropping Variables

4 variables will be dropped. Dataframe will have less columns, what will accelerate calculations on computer.

length(unique(data$UNIQUE.KEY)) == nrow(data)
## [1] TRUE
data<-subset(data, select=-c(UNIQUE.KEY, ZIP.CODE, LOCATION.COORDINATES, LOCATION.TYPE))
colnames(data)
##  [1] "BOROUGH"                       "CONTRIBUTING.FACTOR.VEHICLE.1"
##  [3] "CONTRIBUTING.FACTOR.VEHICLE.2" "DATE"                         
##  [5] "LATITUDE"                      "LONGITUDE"                    
##  [7] "CYCLIST.INJURED"               "CYCLIST.KILLED"               
##  [9] "MOTORIST.INJURED"              "MOTORIST.KILLED"              
## [11] "PEDESTRIANS.INJURED"           "PEDESTRIANS.KILLED"           
## [13] "PERSONS.INJURED"               "PERSONS.KILLED"               
## [15] "OFF.STREET.NAME"               "ON.STREET.NAME"               
## [17] "TIME"                          "VEHICLE.TYPE.CODE.1"          
## [19] "VEHICLE.TYPE.CODE.2"           "CONTRIBUTING.FACTOR.VEHICLE.3"
## [21] "VEHICLE.TYPE.CODE.3"           "CROSS.STREET.NAME"            
## [23] "CONTRIBUTING.FACTOR.VEHICLE.4" "VEHICLE.TYPE.CODE.4"          
## [25] "CONTRIBUTING.FACTOR.VEHICLE.5" "VEHICLE.TYPE.CODE.5"

Data Types

Conversion of variables into correct data types.

data$BOROUGH<-as.factor(data$BOROUGH)
data$TIME<-as.POSIXct(data$TIME,format="%H:%M")
data$TIME<-format(data$TIME, format = "%H:%M")
data$LATITUDE<-as.numeric(data$LATITUDE)
data$LONGITUDE<-as.numeric(data$LONGITUDE)
data$YEAR<-year(data$DATE)
data$MONTH<-month(data$DATE)
data$WEEKDAY<-weekdays(data$DATE)
data$HOUR<-format(as.POSIXct(data$TIME,format="%H:%M"),"%H")

col_names_vect<-colnames(data)

factor_vehicle<-col_names_vect[grepl("FACTOR", col_names_vect)]
vehicle_type<-col_names_vect[grepl("TYPE", col_names_vect)]
street<-vehicles<-col_names_vect[grepl("STREET", col_names_vect)]
injured<-vehicles<-col_names_vect[grepl("INJURED", col_names_vect)]
killed<-vehicles<-col_names_vect[grepl("KILLED", col_names_vect)]

#create vectors with factor and numeric column names
factor_col<-c(factor_vehicle, vehicle_type, street)
numeric_col<-c(injured,killed)

#change type 
data[factor_col] <- lapply(data[factor_col], as.factor)
data[numeric_col] <- lapply(data[numeric_col], as.integer)

str(data)
## 'data.frame':    1185165 obs. of  30 variables:
##  $ BOROUGH                      : Factor w/ 5 levels "BRONX","BROOKLYN",..: 4 NA 3 4 NA 4 4 3 2 2 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 49 levels "Accelerator Defective",..: 15 9 47 47 47 9 18 9 14 14 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 48 levels "Accelerator Defective",..: 46 46 46 46 46 46 NA 9 NA 46 ...
##  $ DATE                         : POSIXct, format: "2012-10-21" "2012-10-21" ...
##  $ LATITUDE                     : num  40.7 40.8 40.8 40.8 NA ...
##  $ LONGITUDE                    : num  -73.8 -73.9 -74 -73.9 NA ...
##  $ CYCLIST.INJURED              : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ CYCLIST.KILLED               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MOTORIST.INJURED             : int  2 3 0 0 0 0 0 0 0 0 ...
##  $ MOTORIST.KILLED              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PEDESTRIANS.INJURED          : int  0 0 0 0 0 0 2 0 1 0 ...
##  $ PEDESTRIANS.KILLED           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PERSONS.INJURED              : int  2 3 0 0 0 1 2 0 1 0 ...
##  $ PERSONS.KILLED               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OFF.STREET.NAME              : Factor w/ 16165 levels "                                ",..: 152 NA 15537 11037 NA 8226 1742 4415 5986 291 ...
##  $ ON.STREET.NAME               : Factor w/ 10180 levels "\177estfarms road                  ",..: 8727 NA 2464 7017 NA 2099 5176 9395 1737 1276 ...
##  $ TIME                         : chr  "11:30" "11:30" "11:50" "11:50" ...
##  $ VEHICLE.TYPE.CODE.1          : Factor w/ 180 levels "315 e","3D","4DSD",..: 132 96 96 96 96 18 132 132 132 96 ...
##  $ VEHICLE.TYPE.CODE.2          : Factor w/ 199 levels "15 PA","1S","3D",..: 148 113 113 113 113 113 NA 148 NA 113 ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 43 levels "Accelerator Defective",..: NA 42 NA NA NA NA NA NA NA NA ...
##  $ VEHICLE.TYPE.CODE.3          : Factor w/ 48 levels "3D","4DSD","AM",..: NA 25 NA NA NA NA NA NA NA NA ...
##  $ CROSS.STREET.NAME            : Factor w/ 90293 levels "                                        ",..: NA NA NA NA 87938 NA NA NA NA NA ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 42 levels "Accelerator Defective",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ VEHICLE.TYPE.CODE.4          : Factor w/ 35 levels "4DSD","AMBULANCE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 32 levels "Aggressive Driving/Road Rage",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ VEHICLE.TYPE.CODE.5          : Factor w/ 23 levels "AMBULANCE","BICYCLE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ YEAR                         : num  2012 2012 2012 2012 2012 ...
##  $ MONTH                        : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ WEEKDAY                      : chr  "Sunday" "Sunday" "Sunday" "Sunday" ...
##  $ HOUR                         : chr  "11" "11" "11" "11" ...

Correct range of values

Let’s check if in our dataset has unusual values.

summary(data)
##           BOROUGH                      CONTRIBUTING.FACTOR.VEHICLE.1
##  BRONX        :113118   Unspecified                   :572508       
##  BROOKLYN     :261275   Driver Inattention/Distraction:173802       
##  MANHATTAN    :214354   Failure to Yield Right-of-Way : 56449       
##  QUEENS       :221896   Fatigued/Drowsy               : 48482       
##  STATEN ISLAND: 39090   Backing Unsafely              : 37359       
##  NA's         :335432   (Other)                       :291185       
##                         NA's                          :  5380       
##                 CONTRIBUTING.FACTOR.VEHICLE.2
##  Unspecified                   :869962       
##  Driver Inattention/Distraction: 48076       
##  Other Vehicular               : 20532       
##  Fatigued/Drowsy               : 13027       
##  Failure to Yield Right-of-Way : 10676       
##  (Other)                       : 62735       
##  NA's                          :160157       
##       DATE                        LATITUDE        LONGITUDE      
##  Min.   :2012-07-01 00:00:00   Min.   : 0.00    Min.   :-201.36  
##  1st Qu.:2013-12-17 00:00:00   1st Qu.:40.67    1st Qu.: -73.98  
##  Median :2015-05-26 00:00:00   Median :40.72    Median : -73.93  
##  Mean   :2015-05-03 11:08:16   Mean   :40.72    Mean   : -73.92  
##  3rd Qu.:2016-09-17 00:00:00   3rd Qu.:40.77    3rd Qu.: -73.87  
##  Max.   :2017-12-31 00:00:00   Max.   :41.13    Max.   :   0.00  
##                                NA's   :213471   NA's   :213471   
##  CYCLIST.INJURED   CYCLIST.KILLED     MOTORIST.INJURED  MOTORIST.KILLED   
##  Min.   :0.00000   Min.   :0.00e+00   Min.   : 0.0000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.00e+00   1st Qu.: 0.0000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.00e+00   Median : 0.0000   Median :0.000000  
##  Mean   :0.02054   Mean   :8.44e-05   Mean   : 0.1851   Mean   :0.000457  
##  3rd Qu.:0.00000   3rd Qu.:0.00e+00   3rd Qu.: 0.0000   3rd Qu.:0.000000  
##  Max.   :4.00000   Max.   :2.00e+00   Max.   :43.0000   Max.   :5.000000  
##                                                                           
##  PEDESTRIANS.INJURED PEDESTRIANS.KILLED PERSONS.INJURED  
##  Min.   : 0.00000    Min.   :0.000000   Min.   : 0.0000  
##  1st Qu.: 0.00000    1st Qu.:0.000000   1st Qu.: 0.0000  
##  Median : 0.00000    Median :0.000000   Median : 0.0000  
##  Mean   : 0.05195    Mean   :0.000663   Mean   : 0.2566  
##  3rd Qu.: 0.00000    3rd Qu.:0.000000   3rd Qu.: 0.0000  
##  Max.   :27.00000    Max.   :6.000000   Max.   :43.0000  
##                                                          
##  PERSONS.KILLED                             OFF.STREET.NAME  
##  Min.   :0.000000                                   : 33212  
##  1st Qu.:0.000000   3 AVENUE                        : 10493  
##  Median :0.000000   BROADWAY                        : 10296  
##  Mean   :0.001206   2 AVENUE                        :  8952  
##  3rd Qu.:0.000000   5 AVENUE                        :  7444  
##  Max.   :8.000000   (Other)                         :827699  
##                     NA's                            :287069  
##                           ON.STREET.NAME       TIME          
##                                  : 25765   Length:1185165    
##  BROADWAY                        : 12390   Class :character  
##  ATLANTIC AVENUE                 : 10877   Mode  :character  
##  NORTHERN BOULEVARD              :  8622                     
##  3 AVENUE                        :  8434                     
##  (Other)                         :888594                     
##  NA's                            :230483                     
##                     VEHICLE.TYPE.CODE.1
##  PASSENGER VEHICLE            :679753  
##  SPORT UTILITY / STATION WAGON:286044  
##  TAXI                         : 46364  
##  VAN                          : 26470  
##  OTHER                        : 23972  
##  (Other)                      :113667  
##  NA's                         :  8895  
##                     VEHICLE.TYPE.CODE.2
##  PASSENGER VEHICLE            :511567  
##  SPORT UTILITY / STATION WAGON:217064  
##  UNKNOWN                      : 81453  
##  TAXI                         : 38724  
##  BICYCLE                      : 25874  
##  (Other)                      :131502  
##  NA's                         :178981  
##                 CONTRIBUTING.FACTOR.VEHICLE.3
##  Unspecified                   :  70611      
##  Other Vehicular               :   1410      
##  Driver Inattention/Distraction:   1268      
##  Fatigued/Drowsy               :   1122      
##  Following Too Closely         :    416      
##  (Other)                       :   2174      
##  NA's                          :1108164      
##                     VEHICLE.TYPE.CODE.3 
##  PASSENGER VEHICLE            :  61597  
##  SPORT UTILITY / STATION WAGON:  31371  
##  UNKNOWN                      :   3285  
##  TAXI                         :   3087  
##  PICK-UP TRUCK                :   2170  
##  (Other)                      :   5526  
##  NA's                         :1078129  
##                                 CROSS.STREET.NAME  
##                                          :  64094  
##  PARKING LOT 110-00 ROCKAWAY BOULEVARD   :    150  
##  772       EDGEWATER ROAD                :    131  
##  110-00    ROCKAWAY BOULEVARD            :     95  
##  PARKING LOT-772 EDGEWATER RD            :     91  
##  (Other)                                 : 118341  
##  NA's                                    :1002263  
##                 CONTRIBUTING.FACTOR.VEHICLE.4
##  Unspecified                   :  15322      
##  Other Vehicular               :    253      
##  Fatigued/Drowsy               :    222      
##  Driver Inattention/Distraction:    204      
##  Following Too Closely         :     93      
##  (Other)                       :    490      
##  NA's                          :1168581      
##                     VEHICLE.TYPE.CODE.4 
##  PASSENGER VEHICLE            :  24309  
##  SPORT UTILITY / STATION WAGON:  14002  
##  TAXI                         :   1666  
##  PICK-UP TRUCK                :   1267  
##  BICYCLE                      :   1059  
##  (Other)                      :   1670  
##  NA's                         :1141192  
##                 CONTRIBUTING.FACTOR.VEHICLE.5
##  Unspecified                   :   3805      
##  Other Vehicular               :     59      
##  Fatigued/Drowsy               :     48      
##  Driver Inattention/Distraction:     39      
##  Pavement Slippery             :     24      
##  (Other)                       :    114      
##  NA's                          :1181076      
##                     VEHICLE.TYPE.CODE.5       YEAR          MONTH       
##  PASSENGER VEHICLE            :   5358   Min.   :2012   Min.   : 1.000  
##  SPORT UTILITY / STATION WAGON:   3034   1st Qu.:2013   1st Qu.: 4.000  
##  TAXI                         :    240   Median :2015   Median : 7.000  
##  PICK-UP TRUCK                :    206   Mean   :2015   Mean   : 6.872  
##  UNKNOWN                      :     94   3rd Qu.:2016   3rd Qu.:10.000  
##  (Other)                      :    254   Max.   :2017   Max.   :12.000  
##  NA's                         :1175979                                  
##    WEEKDAY              HOUR          
##  Length:1185165     Length:1185165    
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

From this summarision, we can notice that we have plenty of missing values, however these will be analyzed later. LONGITUDE variable has several values -200, which are our of the range - Longitude for NYC is around -74.

Missing Values

As we can see on above summary - there are plenty of missing values in this dataset. Visualization below represents amount of missing values in every column.

missing_plot<-data %>% summarize_all(funs(sum(is.na(.))/length(.))) %>% gather %>% ggplot(aes(x = reorder(key, value), y = value)) + geom_bar(stat = "identity", fill = "blue") + coord_flip() + xlab("Variables") + ylab("Missing values percentage")
ggplotly(missing_plot, tooltip=c("y"))

We can notice that over 90% of data is missing for following columns:

VEHICLE.TYPE.CODE.3

VEHICLE.TYPE.CODE.4

VEHICLE.TYPE.CODE.5

CONTRIBUTING.FACTOR.VEHICLE.3

CONTRIBUTING.FACTOR.VEHICLE.4

CONTRIBUTING.FACTOR.VEHICLE.5

This shows that less than 10% of collisions in NYC contain more than 2 vehicles. These 6 columns will be dropped. Moreover, We can notice that columns such as LATITUDE and LONGITUDE have the same amount of missing value with other columns.

missing_data<-data[is.na(data$LATITUDE),]
print(paste0("There are ",nrow(missing_data)," missing observations"))
## [1] "There are 213471 missing observations"
colMeans(is.na(missing_data))
##                       BOROUGH CONTRIBUTING.FACTOR.VEHICLE.1 
##                             1                             1 
## CONTRIBUTING.FACTOR.VEHICLE.2                          DATE 
##                             1                             1 
##                      LATITUDE                     LONGITUDE 
##                             1                             1 
##               CYCLIST.INJURED                CYCLIST.KILLED 
##                             1                             1 
##              MOTORIST.INJURED               MOTORIST.KILLED 
##                             1                             1 
##           PEDESTRIANS.INJURED            PEDESTRIANS.KILLED 
##                             1                             1 
##               PERSONS.INJURED                PERSONS.KILLED 
##                             1                             1 
##               OFF.STREET.NAME                ON.STREET.NAME 
##                             1                             1 
##                          TIME           VEHICLE.TYPE.CODE.1 
##                             1                             1 
##           VEHICLE.TYPE.CODE.2 CONTRIBUTING.FACTOR.VEHICLE.3 
##                             1                             1 
##           VEHICLE.TYPE.CODE.3             CROSS.STREET.NAME 
##                             1                             1 
## CONTRIBUTING.FACTOR.VEHICLE.4           VEHICLE.TYPE.CODE.4 
##                             1                             1 
## CONTRIBUTING.FACTOR.VEHICLE.5           VEHICLE.TYPE.CODE.5 
##                             1                             1 
##                          YEAR                         MONTH 
##                             1                             1 
##                       WEEKDAY                          HOUR 
##                             1                             1
data<-data[!is.na(data$LATITUDE),]
missing_plot2<-data %>% summarize_all(funs(sum(is.na(.))/length(.))) %>% gather %>% ggplot(aes(x = reorder(key, value), y = value)) + geom_bar(stat = "identity", fill = "green") + coord_flip() + xlab("Variables") + ylab("Missing values percentage")
missing_plot2

col_names_vect<-colnames(data)

multicars1<-col_names_vect[grepl("VEHICLE.3", col_names_vect)]
multicars2<-col_names_vect[grepl("VEHICLE.4", col_names_vect)]
multicars3<-col_names_vect[grepl("VEHICLE.5", col_names_vect)]
multicars4<-col_names_vect[grepl("CODE.3", col_names_vect)]
multicars5<-col_names_vect[grepl("CODE.4", col_names_vect)]
multicars6<-col_names_vect[grepl("CODE.5", col_names_vect)]
latitude<-col_names_vect[grepl("LATITUDE", col_names_vect)]
longitude<-col_names_vect[grepl("LONGITUDE", col_names_vect)]

data_copy=data

multi_collision_with_location<-c(multicars1, multicars2, multicars3, multicars4, multicars5, multicars6, latitude, longitude)

multi_collision_without_location<-multi_collision_with_location [! multi_collision_with_location %in% c(latitude, longitude)]

data_multi_collision_with_location<-data[multi_collision_with_location]

data <- data[, !colnames(data) %in% multi_collision_without_location]

Highly - Correlated Variables

num <- data %>% select(which(sapply(data, is.numeric)))
num <- num[complete.cases(num),]
corrplot(cor(num), method="circle")

Correlation plot shows that PERSONS.INJURED and PERSONS.KILLED have a strong positive correlation with columns such as MOTORIST.INJURED or PEDESTRAINS.INJURED.

There are 2 insights to notice.

  1. Insight about deaths: mostly pedestrians were killed, then motorists.
  2. In case of injury: majority of injuries belongs to motorists.

Time Analysis

Collisions by Year

yearDistribution<-ggplot(data, aes(YEAR)) +   
  geom_bar(aes(fill = BOROUGH), position = "dodge", stat="count") +scale_fill_brewer(palette = "Set2") + xlab("Year") + ylab("Amount of collisions") + ggtitle("Number of Collisions Per Year for each Borough")
yearDistribution  #+ geom_text(stat='count', aes(label=..count..), vjust=-1)

#ggplotly(yearDistribution, tooltip = c("y"))

2012 contains less collisions, because we retrieved data from 1st of July.

Collisions by Month

monthDistribution<-ggplot(data,aes(MONTH)) + geom_bar(colour='red', fill='blue', stat = "count") + ggtitle('Number of accidents in dataset for each Borough') + scale_x_discrete(limits = month.abb) + xlab("Month") + ylab("Amount of collisions") + scale_fill_brewer(palette = "Set1") +facet_wrap(~BOROUGH) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplotly(monthDistribution, tooltip = c("y")) 

Heatmap Month/Year

df_heatmap<-data[,c("YEAR", "MONTH", "TIME")]
df_heatmap <- aggregate(TIME ~ YEAR + MONTH, data = df_heatmap, FUN = length)
names(df_heatmap) <- c("YEAR", "MONTH", "COUNT")
df_heatmap$MONTH <- as.factor(month.abb[df_heatmap$MONTH])
hm<-create_heatmap(df_heatmap)
hm + geom_text(aes(label=COUNT))

ggplotly(hm, tooltip = c("COUNT"))

Deaths

df_heatmap_kill<-data[,c("YEAR", "MONTH", "PERSONS.KILLED")]
PERSONS.KILLED<-data$PERSONS.KILLED
df_heatmap_kill<-aggregate_heatmap(df_heatmap_kill, PERSONS.KILLED)
hm_kill<-create_heatmap(df_heatmap_kill)
hm_kill + geom_text(aes(label=COUNT))

#print(paste0("There have been ",sum(PERSONS.KILLED)," death accidents since 01/07/2012 until 31/12/2017")))

Injuries

df_heatmap_injured <- data[,c("YEAR", "MONTH", "PERSONS.INJURED")]
PERSONS.INJURED <- df_heatmap_injured$PERSONS.INJURED
df_heatmap_injured <- aggregate_heatmap(df_heatmap_injured, PERSONS.INJURED)
hm_injured <- create_heatmap(df_heatmap_injured)
hm_injured + geom_text(aes(label=COUNT))

#print(paste0("There have been ",sum(data$PERSONS.INJURED)," death accidents since 01/07/2012 until 31/12/2017")))

Collisions by Weekday

weekdayPlot<-ggplot(data,aes(WEEKDAY)) + geom_bar(colour="black", fill="purple") + ggtitle("Accidents by weekday") +ylab('Accidents') + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + scale_x_discrete(labels = paste0(weekdays(Sys.Date()+4:10)))

weekdayPlot

#ggplotly(weekdayPlot)
weekdayPlot + facet_wrap(~BOROUGH)

Collisions per Hour

plotHour<-ggplot(data) + geom_line(aes(HOUR),group = 1,stat = "count", color = "steelblue") + xlab("Hour") 
plotHour

plotHour + facet_wrap(~BOROUGH) 

plotHour + facet_wrap(~YEAR)

plotHour + facet_wrap(~WEEKDAY)

Location Analysis

library('ggmap')
map <- get_map("new york", zoom = 10)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=new+york&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&sensor=false
## Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=new%20york&sensor=false
ggmap(map) + geom_point(data = data, aes(x = LONGITUDE, y = LATITUDE, color=BOROUGH))

Injuries Location

df_injured<-data[,c("LATITUDE","LONGITUDE","PEDESTRIANS.INJURED","CYCLIST.INJURED","MOTORIST.INJURED")] %>% gather(TYPE,VALUE,3:5) %>% na.omit() %>% group_by(LATITUDE,LONGITUDE,TYPE) %>% summarise(TOTAL=sum(VALUE,na.rm=T)) %>% filter(TOTAL!=0)

Pedestrians

 type<-"PEDESTRIANS.INJURED"
 pedestrianInjured <-geo_plot(map,df_injured,type)
 pedestrianInjured

Over 10 pedestrians Injured

filteredData<-df_injured %>% filter(TOTAL >10)

 type<-"PEDESTRIANS.INJURED"
 pedestrianInjured1 <-geo_plot_2(map,filteredData,type)
 pedestrianInjured1

Cyclists

 type<-"CYCLIST.INJURED"
 cyclistInjured <-geo_plot(map,df_injured,type)
 cyclistInjured

Over 15 cyclists injured

filteredData<-df_injured %>% filter(TOTAL > 15)


 type<-"CYCLIST.INJURED"
 cyclistInjured1 <-geo_plot_2(map,filteredData,type)
 cyclistInjured1

Motorists

 type<-"MOTORIST.INJURED"
 motoristInjured <-geo_plot(map,df_injured,type)
 motoristInjured
## Warning: Removed 1 rows containing missing values (geom_point).

Over 120 motorists injured

filteredData<-df_injured %>% filter(TOTAL > 120)


 type<-"MOTORIST.INJURED"
 motoristInjured1 <-geo_plot_2(map,filteredData,type)
 motoristInjured1

Deaths Location

df_killed <- data[,c("LATITUDE","LONGITUDE","PEDESTRIANS.KILLED","CYCLIST.KILLED","MOTORIST.KILLED")] %>% gather(TYPE,VALUE,3:5) %>% na.omit() %>% group_by(LATITUDE,LONGITUDE,TYPE) %>% summarise(TOTAL=sum(VALUE,na.rm=T)) %>% filter(TOTAL!=0)

Pedestrians

 type<-"PEDESTRIANS.KILLED"
 pedestrianKilled <-geo_plot(map,df_killed,type)
 pedestrianKilled

Over 4 pedestrians killed

filteredData<-df_killed %>% filter(TOTAL > 3)

 type<-"PEDESTRIANS.KILLED"
 pedestrianKilled1 <-geo_plot_2(map,filteredData,type)
 pedestrianKilled1

Cyclists

 type<-"CYCLIST.KILLED"
 cyclistKilled <-geo_plot(map,df_killed,type)
 cyclistKilled

2 cyclists killed

filteredData<-df_killed %>% filter(TOTAL > 1)
  
 type<-"CYCLIST.KILLED"
 cyclistKilled1 <-geo_plot_2(map,filteredData,type)
 cyclistKilled1

 cyclistKilled1 

Motorists

 type<-"MOTORIST.KILLED"
 motoristKilled <-geo_plot(map,df_killed,type)
 motoristKilled

Over 3 motorists killed

filteredData<-df_killed %>% filter(TOTAL > 3)

 type<-"MOTORIST.KILLED"
 motoristKilled <-geo_plot_2(map,filteredData,type)
 motoristKilled

Contributing Factors Analysis

factorData <- data_copy %>% select(PERSONS.KILLED,PERSONS.INJURED,CONTRIBUTING.FACTOR.VEHICLE.1,CONTRIBUTING.FACTOR.VEHICLE.2,CONTRIBUTING.FACTOR.VEHICLE.3,CONTRIBUTING.FACTOR.VEHICLE.4,CONTRIBUTING.FACTOR.VEHICLE.5) %>% gather(TYPE,VALUE,1:2) %>% gather(VEHICLE_TYPE,CAUSE,1:5) %>% filter(VALUE!=0,CAUSE!="",CAUSE!="Unspecified")
## Warning: attributes are not identical across measure variables;
## they will be dropped
factorData1 <- factorData %>% select(-VEHICLE_TYPE) %>% group_by(TYPE,CAUSE) %>% summarise(TOTAL=sum(VALUE,na.rm=T))


cause_plot(factorData1)

Unsafe speed locations

cause <- "Unsafe Speed"
causeLocation<- create_cause_dataset(data_copy,cause)
df1<-causeLocation[,c("LATITUDE", "LONGITUDE", "BOROUGH")]
ggmap(map) + geom_point(data = df1, aes(x = LONGITUDE, y = LATITUDE, color=BOROUGH), size = 3, shape = 1)

Further Analysis Suggestions

Clustering

locationDF<-data[,c("LATITUDE","LONGITUDE")]
locationDF<-locationDF[complete.cases(locationDF),]
km<-kmeans(locationDF, 5)
locationDF$CLUSTER<-as.factor(km$cluster)

ggmap(map) + geom_point(data = locationDF, aes(x = LONGITUDE, y = LATITUDE, color=CLUSTER), size = 3, shape = 1)
## Warning: Removed 8 rows containing missing values (geom_point).

num1<-subset(num, select=-c(PERSONS.KILLED, PERSONS.INJURED))
km<-kmeans(num1, 3)
num1$CLUSTER<-as.factor(km$cluster)

ggmap(map) + geom_point(data = num1, aes(x = LONGITUDE, y = LATITUDE, color=CLUSTER), size = 3, shape = 1)
## Warning: Removed 8 rows containing missing values (geom_point).

Summary of Analysis

This comprehensive analysis provided answers for questions below:

1. When collisions happen the most frequently? Is it correlated with specific months or weekdays?

Collisions happen mostly frequently in the middle of the week.

Time: there is a high amount of collisions at 8AM and 4PM. The biggest amount of collisions happen, when people come back from work. However, we can notice that amount of collisions in night increases during weekend.

2. Where collisions appear the most frequently? Which boroughs are the most dangerous?

The biggest amount of accidents we can notice in downtown of Manhattan and the smallest amount - in Staten Island.

3. What are the main factors that contribute collisions in NYC?

Main factors are: Distraction, Failure to Yield Right-of-Way and Traffic Control Disregarded.